This R-Markdown contains the code used to conduct the musical analysis between 2 psychedelic compounds Ayahuasca and LSD.It consists of 3 parts: 1. Data Preprocessing 2. Explortory data visualisations 3. Building 2 classifiers to test how well musical features can differentiate tracks between the two psychedelic cultures.
The script setup:
#Packages needed
#install.packages("dplyr", "tidyverse", "pacman")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.3
## ✓ tibble 3.0.4 ✓ stringr 1.4.0
## ✓ tidyr 1.0.2 ✓ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(pacman)
#Step One: Filter the data
#The data has already been grouped into Drugs from the keyword searches (column = "Drug") so we can subset using this column
data_Ayahuasca <- CDS_data %>% subset(Drug == "Ayahuasca")
data_LSD <- CDS_data %>% subset(Drug == "LSD")
#Step Two: Calculate the within Drug Weighted Index (TF-IDF equation)
#i) create a drug duplicates column, ii) Calculate new TF term, iii) calculate the new index
data_Ayahuasca <- data_Ayahuasca %>% group_by(TrackName, Artists) %>% mutate(Drug_Dups = n()) #i
data_Ayahuasca <- data_Ayahuasca %>% group_by(TrackName, Artists) %>% mutate(Drug_TF = sum(FollowCount / NumOfTracks)) # ii
data_Ayahuasca <- data_Ayahuasca %>% mutate(Drug_Index = Drug_TF *log10(1 + Inverse)) # iii
data_LSD <- data_LSD %>% group_by(TrackName, Artists) %>% mutate(Drug_Dups = n()) #i
data_LSD <- data_LSD %>% group_by(TrackName, Artists) %>% mutate(Drug_TF = sum(FollowCount / NumOfTracks))#ii
data_LSD <- data_LSD %>% mutate(Drug_Index = Drug_TF *log10(1 + Inverse)) #iii
#Now we have our ayahuasca data in a data frame with a column for duplicates within compound (Drug_Dups), a column for within compound TF (Drug_TF), and a new within compound index (Drug_Index)
#Step Three: Remove the duplicates within the compound groups
data_LSD <- data_LSD %>% distinct(TrackName, Artists, .keep_all = TRUE) #removes 9575 tracks (23%)
data_Ayahuasca <- data_Ayahuasca %>% distinct(TrackName, Artists, .keep_all = TRUE) #removes 7743 tracks (52%)
#Step Four: Select the top 7000 tracks from each compound group, ranked by new Drug_index weight
top_7k_LSD <- as.data.frame(data_LSD) %>% slice_max(Drug_Index, n = 7000, with_ties = F)
top_7k_Aya <- as.data.frame(data_Ayahuasca) %>% slice_max(Drug_Index, n = 7000, with_ties = F)
#Step Five: Merge the dataframes back together to start the data analysis
CDS_data <- rbind.data.frame(top_7k_Aya, top_7k_LSD) #We now have a dataframe with 14,000 tracks
Exploratory data analysis was conducted to compare the two compounds.This took the form of: i) Checking the correlations ii) Boxplots iii) Density histograms iv) Checking the significance of differences using t-tests v) Scatter plots of interesting variables to see if the differences can be seen
#Packages needed:
#install.packages("ggplot2", "funModeling")
library(ggplot2)
library(funModeling)
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
## funModeling v.1.9.4 :)
## Examples and tutorials at livebook.datascienceheroes.com
## / Now in Spanish: librovivodecienciadedatos.ai
#Getting our data into numerical form
#Make a numerical dataframe with the musical features - including within each drug grounp
CDS_numerical <- CDS_data %>% select(c(Drug, acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence))
Ayahuasca_numerical <- top_7k_Aya %>% select(c(Drug, acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence))
LSD_numerical <- top_7k_LSD %>% select(c(Drug, acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence))
#Make the drug column into a factor to allow numerical plotting
CDS_numerical$Drug <- as.factor(CDS_numerical$Drug)
Ayahuasca_numerical$Drug <- as.factor(Ayahuasca_numerical$Drug)
LSD_numerical$Drug <- as.factor(LSD_numerical$Drug)
#Step One: Correlations
#Check the correlations among the variables using corrplot() function
library(corrplot)
## corrplot 0.84 loaded
correlations <- cor(CDS_numerical[,2:9]) #Make a variable excluding drug to compare the musical features
corrplot(correlations, method="circle")
##Results: Acousticness has strong negative correlation with danceability, energy, and loudness.
#Loudness has strong positive correlation with danceability and energy
#Danceability has a moderately strong positive correlation with energy
#Step Two: Boxplots
#Use the funModeling package and function plotar() to make boxplots of all the musical variables:
plotar(data=CDS_numerical, target="Drug", plot_type="boxplot")
## Warning: `fun.y` is deprecated. Use `fun` instead.
## Warning: `fun.y` is deprecated. Use `fun` instead.
## Warning: `fun.y` is deprecated. Use `fun` instead.
## Warning: `fun.y` is deprecated. Use `fun` instead.
## Warning: `fun.y` is deprecated. Use `fun` instead.
## Warning: `fun.y` is deprecated. Use `fun` instead.
## Warning: `fun.y` is deprecated. Use `fun` instead.
## Warning: `fun.y` is deprecated. Use `fun` instead.
## Warning: `fun.y` is deprecated. Use `fun` instead.
## Results: Big differences seen in acousticness, energy and instrumentalness
#Moderate differences seen in danceability, loudness, tempo and valence
#Minimal differences seen in liveness and speechiness
#Step 3: Density Histograms
#Use the funModeling package and function plotar()to make density histograms to compare the distribution of variables
plotar(data=CDS_numerical, target="Drug", plot_type="histdens")
## Warning: `summarise_()` is deprecated as of dplyr 0.7.0.
## Please use `summarise()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
##Results: Danceability and tempo have somewhat normal and similar distributions between the compound groups
#Acousticness, energy and instrumentalness show huge variation in means and distribution
#Valence shows similar but highly left skewed distribution for both groups (skew = 0.56, kurt = 2.32)
#Liveness has a highly left skewed but similar distribution (skew = 2.35, kurt = 8.91) for both compounds while loudness has a highly right skewed but similar (skew = -1.27, kurt = 5.20) distribion for both compounds.
#Step 4: Getting the descriptive statistics and significance of differences between each compound
profiling_num(CDS_numerical) #Stats of both groups together (overall subset stats)
## variable mean std_dev variation_coef p_01
## 1 acousticness 0.40467066 0.36864577 0.9109773 0.000024499
## 2 danceability 0.54331271 0.19404900 0.3571589 0.090097000
## 3 energy 0.53406577 0.27290588 0.5109968 0.022500000
## 4 instrumentalness 0.37405771 0.39147732 1.0465693 0.000000000
## 5 liveness 0.18837783 0.16450245 0.8732580 0.046999000
## 6 loudness -10.96507857 5.41746384 -0.4940652 -28.491150000
## 7 speechiness 0.07459249 0.08634546 1.1575624 0.025900000
## 8 tempo 120.88541257 29.43418603 0.2434883 64.655780000
## 9 valence 0.35861748 0.25161787 0.7016330 0.031700000
## p_05 p_25 p_50 p_75 p_95 p_99 skewness
## 1 0.00024895 0.025900 0.3090 0.78900 0.969 0.99100 0.28657941
## 2 0.18200000 0.412750 0.5680 0.68700 0.826 0.90601 -0.39552102
## 3 0.08479000 0.316000 0.5350 0.76300 0.955 0.99200 -0.07312362
## 4 0.00000000 0.000114 0.1650 0.82000 0.928 0.96300 0.31021774
## 5 0.06560000 0.096275 0.1170 0.21800 0.566 0.87604 2.34931960
## 6 -21.63740000 -13.546000 -9.7355 -7.11975 -4.536 -2.75882 -1.27252291
## 7 0.02840000 0.035200 0.0446 0.06870 0.259 0.44401 3.90848089
## 8 75.37285000 97.231000 122.0185 140.04700 171.846 189.11574 0.14115633
## 9 0.03820000 0.144000 0.3160 0.53500 0.835 0.94701 0.56394702
## kurtosis iqr range_98 range_80
## 1 1.449082 0.763100 [2.4499e-05, 0.991] [0.001269, 0.932]
## 2 2.500826 0.274250 [0.090097, 0.90601] [0.255, 0.782]
## 3 1.932052 0.447000 [0.0225, 0.992] [0.149, 0.9131]
## 4 1.282778 0.819886 [0, 0.963] [0, 0.901]
## 5 8.913906 0.121725 [0.046999, 0.876040000000001] [0.0767, 0.382]
## 6 5.199263 6.426250 [-28.49115, -2.75882] [-18.332, -5.4179]
## 7 23.274619 0.033500 [0.0259, 0.44401] [0.0303, 0.148]
## 8 2.657559 42.816000 [64.65578, 189.11574] [81.3009, 159.992]
## 9 2.323957 0.391000 [0.0317, 0.94701] [0.05519, 0.739]
profiling_num(Ayahuasca_numerical) #Ayahuasca means, sd's, skew and kurtosis
## variable mean std_dev variation_coef p_01
## 1 acousticness 0.6394725 0.31199978 0.4879018 0.0020998
## 2 danceability 0.4781168 0.19626932 0.4105050 0.0774990
## 3 energy 0.3695685 0.21920621 0.5931409 0.0138980
## 4 instrumentalness 0.3027125 0.38129281 1.2595873 0.0000000
## 5 liveness 0.1729666 0.15159665 0.8764502 0.0527960
## 6 loudness -13.5944813 5.65375396 -0.4158860 -31.1464200
## 7 speechiness 0.0606477 0.08020739 1.3225134 0.0254000
## 8 tempo 114.5727894 30.76404990 0.2685110 61.3111100
## 9 valence 0.3253003 0.25283776 0.7772442 0.0297000
## p_05 p_25 p_50 p_75 p_95 p_99 skewness
## 1 0.037595 0.4030000 0.75000 0.905250 0.98200 0.99400 -0.7143363
## 2 0.142000 0.3257500 0.49900 0.626250 0.77805 0.86001 -0.1789691
## 3 0.049795 0.1920000 0.35000 0.522000 0.77100 0.89600 0.4032506
## 4 0.000000 0.0000226 0.02135 0.745000 0.93800 0.97001 0.7021085
## 5 0.069800 0.0957000 0.11200 0.177000 0.47005 0.88700 2.8065340
## 6 -24.446550 -16.6830000 -12.55800 -9.523750 -6.20490 -4.56383 -0.9962732
## 7 0.027700 0.0328000 0.03900 0.051525 0.16005 0.47301 5.7924811
## 8 71.948900 90.0237500 111.82150 134.978500 172.06105 191.37213 0.4390477
## 9 0.036500 0.1100000 0.26500 0.490000 0.83600 0.95201 0.7660715
## kurtosis iqr range_98 range_80
## 1 2.156412 0.5022500 [0.0020998, 0.994] [0.116, 0.966]
## 2 2.206481 0.3005000 [0.077499, 0.86001] [0.193, 0.725]
## 3 2.450312 0.3300000 [0.013898, 0.896] [0.09136, 0.676]
## 4 1.698618 0.7449774 [0, 0.97001] [0, 0.909]
## 5 11.899438 0.0813000 [0.052796, 0.887] [0.07879, 0.353]
## 6 4.258405 7.1592500 [-31.14642, -4.56383] [-21.3572, -7.3459]
## 7 44.781118 0.0187250 [0.0254, 0.47301] [0.0292, 0.0939]
## 8 2.818351 44.9547500 [61.31111, 191.37213] [77.9062, 157.9249]
## 9 2.610993 0.3800000 [0.0297, 0.95201] [0.0396, 0.7241]
profiling_num(LSD_numerical) #LSD means, sd's, skew and kurtosis
## variable mean std_dev variation_coef p_01
## 1 acousticness 0.16986884 0.25336818 1.4915518 0.000013697
## 2 danceability 0.60850867 0.16820083 0.2764148 0.168000000
## 3 energy 0.69856303 0.21630553 0.3096435 0.123990000
## 4 instrumentalness 0.44540293 0.38854205 0.8723383 0.000000000
## 5 liveness 0.20378901 0.17512669 0.8593530 0.042600000
## 6 loudness -8.33567586 3.59271422 -0.4310046 -20.561010000
## 7 speechiness 0.08853729 0.08994414 1.0158899 0.026400000
## 8 tempo 127.19803571 26.58439062 0.2090000 73.004930000
## 9 valence 0.39193466 0.24593663 0.6274940 0.035200000
## p_05 p_25 p_50 p_75 p_95 p_99 skewness
## 1 0.000073485 0.0027400 0.03585 0.23000 0.7870 0.95600 1.66377113
## 2 0.296000000 0.5070000 0.62800 0.73000 0.8540 0.92401 -0.50992324
## 3 0.300950000 0.5490000 0.72850 0.88300 0.9770 0.99600 -0.65183275
## 4 0.000000000 0.0015075 0.52050 0.84100 0.9190 0.95000 -0.04923239
## 5 0.061695000 0.0969750 0.12600 0.26225 0.6190 0.87204 2.00809722
## 6 -14.924000000 -9.9250000 -7.71300 -6.03775 -3.8239 -2.17084 -1.60242746
## 7 0.029600000 0.0398000 0.05380 0.08800 0.3010 0.43101 2.71360346
## 8 81.324950000 109.4750000 128.12800 144.02300 170.3813 187.69350 -0.05554600
## 9 0.047295000 0.1820000 0.36200 0.56900 0.8340 0.94400 0.39875011
## kurtosis iqr range_98 range_80
## 1 4.700480 0.2272600 [1.3697e-05, 0.956] [0.0002597, 0.599]
## 2 2.946297 0.2230000 [0.168, 0.92401] [0.373, 0.808]
## 3 2.763896 0.3340000 [0.12399, 0.996] [0.397, 0.954]
## 4 1.193089 0.8394925 [0, 0.95] [0, 0.896]
## 5 7.052634 0.1652750 [0.0426, 0.872040000000001] [0.07399, 0.417]
## 6 9.228778 3.8872500 [-20.56101, -2.17084] [-12.6613, -4.6639]
## 7 11.826978 0.0482000 [0.0264, 0.43101] [0.0327, 0.211]
## 8 2.835613 34.5480000 [73.00493, 187.6935] [89.9908, 160.033]
## 9 2.184401 0.3870000 [0.0352, 0.944] [0.0782, 0.753]
#Run independant t-tests between the two groups to compare whether the differences are statistically significant (or whether they just emerged by chance). Our point of significance is p < .05
#Significance T-tests
t.test(CDS_numerical$acousticness ~ CDS_numerical$Drug, var.equal = TRUE) #Acousticness:t = 97.8, p < .05
##
## Two Sample t-test
##
## data: CDS_numerical$acousticness by CDS_numerical$Drug
## t = 97.756, df = 13998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.4601874 0.4790198
## sample estimates:
## mean in group Ayahuasca mean in group LSD
## 0.6394725 0.1698688
t.test(CDS_numerical$danceability ~ CDS_numerical$Drug, var.equal = TRUE) #Danceability: t = -42.2, p < .05
##
## Two Sample t-test
##
## data: CDS_numerical$danceability by CDS_numerical$Drug
## t = -42.205, df = 13998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1364477 -0.1243362
## sample estimates:
## mean in group Ayahuasca mean in group LSD
## 0.4781168 0.6085087
t.test(CDS_numerical$energy ~ CDS_numerical$Drug, var.equal = TRUE) #Energy: t = -89.4, p < .05
##
## Two Sample t-test
##
## data: CDS_numerical$energy by CDS_numerical$Drug
## t = -89.381, df = 13998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3362094 -0.3217796
## sample estimates:
## mean in group Ayahuasca mean in group LSD
## 0.3695685 0.6985630
t.test(CDS_numerical$instrumentalness ~ CDS_numerical$Drug, var.equal = TRUE) #Instru: t = -21.9, p < .05
##
## Two Sample t-test
##
## data: CDS_numerical$instrumentalness by CDS_numerical$Drug
## t = -21.93, df = 13998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1554442 -0.1299367
## sample estimates:
## mean in group Ayahuasca mean in group LSD
## 0.3027125 0.4454029
t.test(CDS_numerical$liveness ~ CDS_numerical$Drug, var.equal = TRUE) #Live: t = -11.1, p < .05
##
## Two Sample t-test
##
## data: CDS_numerical$liveness by CDS_numerical$Drug
## t = -11.133, df = 13998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.03624894 -0.02539580
## sample estimates:
## mean in group Ayahuasca mean in group LSD
## 0.1729666 0.2037890
t.test(CDS_numerical$loudness ~ CDS_numerical$Drug, var.equal = TRUE) #Loud: t = -65.7, p < .05
##
## Two Sample t-test
##
## data: CDS_numerical$loudness by CDS_numerical$Drug
## t = -65.682, df = 13998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.415743 -5.101868
## sample estimates:
## mean in group Ayahuasca mean in group LSD
## -13.594481 -8.335676
t.test(CDS_numerical$speechiness ~ CDS_numerical$Drug, var.equal = TRUE) #Speech: t = -19.4, p < .05
##
## Two Sample t-test
##
## data: CDS_numerical$speechiness by CDS_numerical$Drug
## t = -19.362, df = 13998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.03071295 -0.02506622
## sample estimates:
## mean in group Ayahuasca mean in group LSD
## 0.06064770 0.08853729
t.test(CDS_numerical$tempo ~ CDS_numerical$Drug, var.equal = TRUE) #Tempo : t = -25.98, p < .05
##
## Two Sample t-test
##
## data: CDS_numerical$tempo by CDS_numerical$Drug
## t = -25.98, df = 13998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -13.57781 -11.67268
## sample estimates:
## mean in group Ayahuasca mean in group LSD
## 114.5728 127.1980
t.test(CDS_numerical$valence ~ CDS_numerical$Drug, var.equal = TRUE) #Valence: t = -15.8, p < .05
##
## Two Sample t-test
##
## data: CDS_numerical$valence by CDS_numerical$Drug
## t = -15.806, df = 13998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.07489792 -0.05837080
## sample estimates:
## mean in group Ayahuasca mean in group LSD
## 0.3253003 0.3919347
##Results are all summarised in Table X of the final project write up
#Step 5: Scatter Plots
#Scatter 1: Acousticness and energy are highly correlated, with big differences between the two means. Let's plot a scatter plot to observe the relationship in these 2 variables between the compound groups:
ggplot(CDS_numerical, aes(x = acousticness, y = energy)) +
geom_point(aes(colour = Drug, alpha=0.01)) +
scale_color_manual(values = c("#CA3C97", "#EDD9A3")) +
theme_bw() +
theme(
plot.title = element_text(size=12)
) + theme(panel.border = element_blank()) +
labs(x = "Acousticness", y = "Energy", title = "Scatterplot of Acousticness and Energy between Compound Groups") +
theme(axis.text.x = element_text(face = 'bold', size = 8),
axis.text.y = element_text(face = 'bold', size = 8))
#Comments: Here the difference between the compound groups is clearly visable, the LSD tracks cluster around the higher end of energy with low acousticness, while the Ayahuasca tracks cluster at the higher end of acousticness and lower end of energy
#Scatter 2: Valence and Danceability show smaller but still significant differences between the two means. We'll also visualise these using a scatterplot:
ggplot(CDS_numerical, aes(x = danceability, y = instrumentalness)) +
geom_point(aes(colour = Drug, alpha=0.01)) +
scale_color_manual(values = c("#CA3C97", "#EDD9A3")) +
theme_bw() +
theme(
plot.title = element_text(size=12)
) + theme(panel.border = element_blank()) +
labs(x = "Valence", y = "Danceability", title = "Scatterplot of Valence and Danceability between Compound Groups") +
theme(axis.text.x = element_text(face = 'bold', size = 8),
axis.text.y = element_text(face = 'bold', size = 8))
#Comments: The differences here are less visable but we still see distinct clusters of LSD tracks at the higher end of both valence and danceability, and Ayahuasca being spread much more with many tracks clustering at the high end of danceability but lower end of valence.
It’s clear to see that musical differences exist between the two groups, but we want to go beyond the significance testing and see how well a classifier will be able to distinguish between the two groups based only on their musical features. Typical classifiers are considered to be good if they can achieve an accuracy above 70%. The higher the accuracy, the more the selected variables are able to digtinguish or explain the two groups. In this context, the classifier will be telling us how well the musical features define each compound group, and thus the musical choice of the culture behind the compound. The higer the accuracy, the more confidence we can have in inferring why the musical features may distribute as they do between the two cultures. Let’s jump in!
Classiification is the process of predicting categorical outcomes, given input data points. It is essentially an algorithm which compares the variable inputs and maps the input into a specific category. Here our category’s are the compound groups of Ayahuasca (representing our shamanic culture) and LSD (representing our pyshcedelia culture) and out input variables are the musical features. We are not especially interested in the extent to which each variable can explain the categorys, but more testing to see if enough difference exists to create accurate classification which could be used to guide further investigations in the future and support that fundamental differences in the culture exist.
Two approaches will be used to build a classifier, and the accuracy of each will be compared: i) Binomial Logistic Regression Classification ii) Random Forest Classification
Binomial Logistic regression is a fundamental algorithm for supervised machine learning classification. It works by fitting datapoints to a curve, much like linear regression only as the outcome is categorical or binary, they fit a along a Sigmoid “S” function curve, with datapoints clustering at the top (1) and bottom (0) of the curve. Binary Logistic regression is a good starter point for exploring the possibility of classification.
A Random Forest classifier works on the concept of the wisdom of the crowds. It splits the training set into a number of decision trees (ie predictions of which class this track would belong to). Multiple trees are allocated to each decision to predict which class the observation belongs to and the class prediction with the most votes becomes the model’s prediction. It is one of the most accurate methods of classification and becoming widely popular within machine learning.
The Binomial Logistic Regression Classifier:
set.seed(999) #this allows for replication
#The logistic regression classifier cannot handle all 9 musical variables and so the 5 with the greatest variance between groups were chosen as predictors. These were: acousticness, energy, instrumentalness, danceability, and tempo
#The data
BLR_Data <- CDS_numerical %>% select(c(acousticness, energy, instrumentalness, danceability, tempo, Drug))
BLR_Data$Drug <- as.factor(BLR_Data$Drug)
#The functions
#createDataPartition() - splits data into train and test sets based on our target (Drug)
#glm() - this is one of the most versatile of models to use for logistic reg in R
# Loading caret library - this supports the splitting of data
require(caret)
## Loading required package: caret
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
## The following object is masked from 'package:purrr':
##
## lift
# Splitting the data into train (70%) and test (30%) sets
BLR_Model_index <- createDataPartition(BLR_Data$Drug, p = .70, list = FALSE)
BLR_Model_train <- BLR_Data[BLR_Model_index, ]
BLR_Model_test <- BLR_Data[-BLR_Model_index, ]
# Training the model - AIC = 8242
BLR_Model <- glm(Drug ~ ., family = binomial(), data = BLR_Model_train)
# Checking the model
summary(BLR_Model) #There is a big difference between the Null (13586) and Residual (8230) deviance, indicating that our predictors are infact telling us more than a null model
##
## Call:
## glm(formula = Drug ~ ., family = binomial(), data = BLR_Model_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.53172 -0.58409 0.06151 0.61796 2.80460
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.0035681 0.1798262 -11.142 < 2e-16 ***
## acousticness -2.7806421 0.1041778 -26.691 < 2e-16 ***
## energy 3.4142884 0.1544653 22.104 < 2e-16 ***
## instrumentalness 0.8698839 0.0745037 11.676 < 2e-16 ***
## danceability 1.2810164 0.1580537 8.105 5.28e-16 ***
## tempo 0.0018117 0.0009767 1.855 0.0636 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13585.7 on 9799 degrees of freedom
## Residual deviance: 8229.5 on 9794 degrees of freedom
## AIC: 8241.5
##
## Number of Fisher Scoring iterations: 5
#Convert the Coefficients to odds - to make them more interpretable
exp(coef(BLR_Model))
## (Intercept) acousticness energy instrumentalness
## 0.13485325 0.06199869 30.39531204 2.38663386
## danceability tempo
## 3.60029715 1.00181332
# Creating predictions in the test dataset
pred_BLR <- predict(BLR_Model, BLR_Model_test, type = "response")
#Classification Table - First Training set and then test
# Converting from probability to actual output
BLR_Model_train$pred_drug <- ifelse(BLR_Model$fitted.values >= 0.5, "LSD", "Ayahuasca")
# Generating the classification table
ctab_train <- table(BLR_Model_train$Drug, BLR_Model_train$pred_drug)
ctab_train
##
## Ayahuasca LSD
## Ayahuasca 3915 985
## LSD 831 4069
# Converting from probability to actual output
BLR_Model_test$pred_drug <- ifelse(pred_BLR >= 0.5, "LSD", "Ayahuasca")
# Generating the classification table
ctab_test <- table(BLR_Model_test$Drug, BLR_Model_test$pred_drug)
ctab_test
##
## Ayahuasca LSD
## Ayahuasca 1655 445
## LSD 383 1717
#Accuracy = (TP + TN)/(TN + FP + FN + TP)
#Train
accuracy_train <- sum(diag(ctab_train))/sum(ctab_train)*100
accuracy_train # Accuracy in Training dataset - is 81.5% (81.46)
## [1] 81.46939
# Test
accuracy_test <- sum(diag(ctab_test))/sum(ctab_test)*100
accuracy_test #Accuracy in Test dataset = Also 80.3%% (80.28)
## [1] 80.28571
#The accuracy's are close indicating that our models are performing well! We are performing at a rate of around 30% above chance
#Let's explore the True Positive rate (TPR) (ie the Sensitivity)
# Recall or TPR indicates how often does our model predicts actual TRUE from the overall TRUE events.
Recall_TR <- (ctab_train[2, 2]/sum(ctab_train[2, ]))*100
Recall_TR #Answer: Recall in Train dataset - This is 83.0%
## [1] 83.04082
# True Negative Rate(TRN) in Train dataset
#TNR indicates how often does our model predicts actual nonevents from the overall nonevents
TNR <- (ctab_train[1, 1]/sum(ctab_train[1, ]))*100
TNR #Answer: This happens 79.9% of the time
## [1] 79.89796
# Precision in Train dataset (i.e. how often does the model predict the drug when the drug is actually correct)
Precision <- (ctab_train[2, 2]/sum(ctab_train[, 2]))*100
Precision #Answer: This happens 80.5% of the time
## [1] 80.51049
#Calculating the F-Score - F-Score is a harmonic mean of recall and precision. The score value lies between 0 and 1. The value of 1 represents perfect precision & recall. The value 0 represents the worst case.
F_Score <- (2 * Precision * Recall_TR / (Precision + Recall_TR))/100
F_Score #This is 0.82 which is pretty good!
## [1] 0.8175608
#Calculating the AUC value and ROC curve - this gives us an indication of how good our model is
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc_BLR <- roc(BLR_Model_train$Drug, BLR_Model$fitted.values)
## Setting levels: control = Ayahuasca, case = LSD
## Setting direction: controls < cases
auc(roc_BLR) #The area under the curve is .89 which is pretty dam good! (value runs between 0 to 1 and the closer to 1 the better the model)
## Area under the curve: 0.8902
plot(roc_BLR) #This plots our roc values and we see the curve is far away from the diagonal indicating it is a good model
Conclusion: Our Binary Logistic Regression model can predict which compund group a track belongs to with an accuracy around 80% using the musical features of acousticness, energy, instrumentalness, tempo and valence. This is 30% above chance level (50%). Our train and test sets show similar accuracy indicating that the model is not over-fitting and our statistical tests of model evaluation indicate that the classifier has good levels of sensitivity, specificity and auc. This is promising and supports that musical differences exist which are string enough to perhaps tell us something more about the culure they belong to.
Random Forest Classifier: This classifier is designed for using many variable predictors and so for this classifier all musical variables can be used. The random forest package in R even calculates how important each variable is to predicting which compound group the track belongs to.
set.seed(777)
#Load packages needed to build and evaluate the models
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(e1071)
##
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
##
## impute
#The data - We can use our CDS_numerical dataset
#Split the data into a training and test set
RF_split = sample(2, nrow(CDS_numerical), replace=TRUE, prob=c(0.7,0.3))
train_RF = CDS_numerical[RF_split==1,]
test_RF = CDS_numerical[RF_split==2,]
#Cross-validating the model: We're going to cross-validate 10 times to ensure our results are accurate
#First set trControl to the default settings and then run an evaluation on the train data - testing what the optimal number of mtry will be for our model
trControl <- trainControl(method = "cv",
number = 10,
search = "grid")
trEVALUATE <- train(Drug~., train_RF, method = "rf", metric= "Accuracy", trControl = trainControl(), tuneGrid = NULL)
#print results
print(trEVALUATE) # Our evaluation tells us the optimal mtry to use is 2 (accuracy: .83, Kappe = .66) - the model tested 2, 5 and 9 mtry.
## Random Forest
##
## 9829 samples
## 9 predictor
## 2 classes: 'Ayahuasca', 'LSD'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 9829, 9829, 9829, 9829, 9829, 9829, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8280488 0.6558644
## 5 0.8252217 0.6502185
## 9 0.8226500 0.6450957
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
#Search the best MaxNodes
store_maxnode <- list()
tuneGrid <- expand.grid(.mtry = 2)
for (maxnodes in c(5: 15)) {
set.seed(1234)
rf_maxnode <- train(Drug~.,
data = train_RF,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
maxnodes = maxnodes,
ntree = 300)
current_iteration <- toString(maxnodes)
store_maxnode[[current_iteration]] <- rf_maxnode
}
results_mtry <- resamples(store_maxnode)
summary(results_mtry) #Most accuracy achieved at 14 Nodes, so 9 Nodes have been used (Accuracy = .81, Kappa - .61)
##
## Call:
## summary.resamples(object = results_mtry)
##
## Models: 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 5 0.7800407 0.8062055 0.8112920 0.8088281 0.8153611 0.8270600 0
## 6 0.7881874 0.8031536 0.8112920 0.8100497 0.8207019 0.8229908 0
## 7 0.7902240 0.8018820 0.8153611 0.8121862 0.8232452 0.8270600 0
## 8 0.7871690 0.8059512 0.8153611 0.8137118 0.8245168 0.8341811 0
## 9 0.7934893 0.8077314 0.8123093 0.8148319 0.8265514 0.8392675 0
## 10 0.7861507 0.8128179 0.8173957 0.8155428 0.8247711 0.8341811 0
## 11 0.7892057 0.8090031 0.8179044 0.8148310 0.8227365 0.8321465 0
## 12 0.7932790 0.8095117 0.8199390 0.8163574 0.8260427 0.8351984 0
## 13 0.7922607 0.8110376 0.8194303 0.8168659 0.8275687 0.8341811 0
## 14 0.7932790 0.8112920 0.8184130 0.8160522 0.8245168 0.8351984 0
## 15 0.7973523 0.8120549 0.8199390 0.8181889 0.8278230 0.8341811 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 5 0.5598843 0.6121235 0.6223686 0.6174918 0.6305921 0.6542027 0
## 6 0.5761638 0.6061610 0.6225068 0.6199978 0.6412308 0.6461791 0
## 7 0.5800718 0.6036546 0.6305607 0.6242263 0.6464040 0.6542499 0
## 8 0.5739670 0.6116484 0.6305960 0.6271894 0.6486858 0.6683131 0
## 9 0.5868234 0.6151343 0.6244467 0.6294447 0.6528155 0.6785241 0
## 10 0.5720245 0.6253927 0.6345150 0.6308624 0.6493575 0.6682980 0
## 11 0.5780228 0.6176411 0.6355807 0.6294271 0.6452330 0.6642433 0
## 12 0.5860945 0.6187396 0.6397256 0.6324577 0.6517691 0.6703254 0
## 13 0.5840659 0.6218116 0.6386877 0.6334839 0.6548511 0.6684188 0
## 14 0.5861564 0.6222874 0.6366604 0.6318554 0.6487657 0.6704005 0
## 15 0.5942503 0.6238526 0.6396691 0.6361073 0.6553971 0.6680258 0
#Search the best number of ntrees
store_maxtrees <- list()
for (ntree in c(250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000)) {
set.seed(589)
rf_maxtrees <- train(Drug~.,
data = train_RF,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
maxnodes = 14,
ntree = ntree)
key <- toString(ntree)
store_maxtrees[[key]] <- rf_maxtrees
}
results_tree <- resamples(store_maxtrees)
summary(results_tree) #Model concluded 350 trees was optimal (Accuracy = .80, Kappa = .60)
##
## Call:
## summary.resamples(object = results_tree)
##
## Models: 250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 250 0.7822991 0.8123093 0.8199390 0.8170726 0.8239633 0.8392675 0
## 300 0.7833164 0.8120549 0.8179044 0.8172760 0.8259977 0.8413021 0
## 350 0.7812818 0.8120549 0.8218834 0.8179881 0.8255341 0.8423194 0
## 400 0.7853510 0.8138352 0.8189217 0.8181917 0.8257439 0.8402848 0
## 450 0.7833164 0.8148525 0.8188310 0.8176828 0.8247711 0.8423194 0
## 500 0.7853510 0.8140895 0.8203580 0.8184968 0.8255341 0.8423194 0
## 550 0.7833164 0.8130722 0.8184130 0.8173780 0.8259984 0.8402848 0
## 600 0.7843337 0.8130722 0.8184130 0.8175815 0.8254012 0.8402848 0
## 800 0.7833164 0.8130722 0.8184130 0.8176832 0.8259984 0.8402848 0
## 1000 0.7873856 0.8123093 0.8194303 0.8181922 0.8276925 0.8392675 0
## 2000 0.7853510 0.8140895 0.8204476 0.8188024 0.8276917 0.8402848 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 250 0.5641069 0.6242844 0.6395625 0.6338813 0.6477354 0.6784070 0
## 300 0.5661735 0.6237479 0.6355795 0.6342905 0.6517592 0.6824778 0
## 350 0.5621000 0.6237952 0.6435073 0.6357150 0.6507783 0.6845060 0
## 400 0.5702274 0.6273098 0.6376146 0.6361171 0.6511815 0.6804351 0
## 450 0.5661339 0.6293674 0.6373466 0.6350938 0.6493484 0.6844916 0
## 500 0.5702274 0.6278507 0.6404072 0.6367304 0.6508300 0.6844916 0
## 550 0.5661339 0.6257716 0.6366091 0.6344911 0.6517170 0.6804351 0
## 600 0.5681610 0.6258272 0.6365837 0.6348902 0.6505406 0.6804205 0
## 800 0.5661735 0.6258187 0.6365025 0.6350983 0.6518359 0.6804205 0
## 1000 0.5743398 0.6242848 0.6386030 0.6361236 0.6551112 0.6783923 0
## 2000 0.5702470 0.6278422 0.6406467 0.6373410 0.6551141 0.6804205 0
#Generate the random forest
Model_RF = randomForest(Drug~., data=train_RF, ntree=350, mtry= 2, proximity=T)
table(predict(Model_RF), train_RF$Drug)
##
## Ayahuasca LSD
## Ayahuasca 4236 954
## LSD 734 3905
Model_RF # Estimate of error rate = 16.63% - pretty good!
##
## Call:
## randomForest(formula = Drug ~ ., data = train_RF, ntree = 350, mtry = 2, proximity = T)
## Type of random forest: classification
## Number of trees: 350
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 17.17%
## Confusion matrix:
## Ayahuasca LSD class.error
## Ayahuasca 4236 734 0.1476861
## LSD 954 3905 0.1963367
plot(Model_RF) #This plot shows us the mean squared error for each drug group - the model doesn't seem to improve much after 150 trees
importance(Model_RF)#Here the most important variables are acousticness, energy, loudness, speechiness
## MeanDecreaseGini
## acousticness 1209.5065
## danceability 365.3115
## energy 840.8479
## instrumentalness 368.6797
## liveness 267.1778
## loudness 722.1561
## speechiness 447.8585
## tempo 377.6408
## valence 303.5572
varImpPlot(Model_RF) #Plot of the variables importance
#Let's test how good our model is:
Pred_RF = predict(Model_RF, newdata=test_RF)
table(Pred_RF, test_RF$Drug)
##
## Pred_RF Ayahuasca LSD
## Ayahuasca 1763 428
## LSD 267 1713
confusionMatrix(Pred_RF, test_RF$Drug) #This has all the metrics you need to predict!!
## Confusion Matrix and Statistics
##
## Reference
## Prediction Ayahuasca LSD
## Ayahuasca 1763 428
## LSD 267 1713
##
## Accuracy : 0.8334
## 95% CI : (0.8217, 0.8446)
## No Information Rate : 0.5133
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6672
##
## Mcnemar's Test P-Value : 1.286e-09
##
## Sensitivity : 0.8685
## Specificity : 0.8001
## Pos Pred Value : 0.8047
## Neg Pred Value : 0.8652
## Prevalence : 0.4867
## Detection Rate : 0.4227
## Detection Prevalence : 0.5253
## Balanced Accuracy : 0.8343
##
## 'Positive' Class : Ayahuasca
##
#Plot to see whether whether the classification is correct (if + it is correct)
plot(margin(Model_RF, test_RF$Drug)) # we have an upwards slope which looks good
## Warning in RColorBrewer::brewer.pal(nlevs, "Set1"): minimal value for n is 3, returning requested palette with 3 different levels
#Calculating the AUC value and ROC curve - this gives us an indication of how good our model is
#AUC Calculation
library(pROC)
library(ROCR)
rf_p<- predict(Model_RF, type="prob")[,2]
rf_pr <- prediction(rf_p, train_RF$Drug)
r_auc <- performance(rf_pr, measure = "auc")@y.values[[1]]
r_auc #0.91 recall that this should be as close to 1 as possible, so .91 is a great outcome!
## [1] 0.908265
Conclusions: The random forest classifier, which can manage all the variable predictors, improved on the accuracy of the Binomial Logistic Regression model by around 3%. This model is predicting at a rate of 84% which is 34% above the rate of chance (50%) and even more promising for our investigation. Further, the random forest model revealed that the variables most useful to predicting were in fact not the ones which showed the greatest variance in the data visualisation boxplots, but included speechiness and loudness. This is a reminder of how machine learning approaches such as classification can improve on exploratory data analysis and reveal trends not so readily seen by the eye.